-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Core] Convert EngineCoreRequest to Request before reaching the engine core … #21329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the request processing pipeline to improve performance by moving the conversion of EngineCoreRequest
to Request
off the main engine thread's critical path. The changes are well-structured and the performance gains are clearly demonstrated.
I've identified one potential high-severity issue: a race condition on the multi-modal input cache (mm_input_cache_server
) introduced by accessing it from multiple threads without synchronization. I've provided details in a specific comment. Addressing this will ensure the thread safety of the new implementation.
nice idea! I wonder what else we can put into that thread... |
block hashes for sure, as we discussed in a separate PR. Personally, the main benefit for this change is to come up with a centralized place to put all these future ideas :) |
vllm/v1/request.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this to the internal Request
, maybe convert EngineCoreRequest
to a tuple[Request, int]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I might leaning toward adding it in Request instead.
- current_wave is an existing field in EngineCoreRequest
- If we go with tuple[Request, int], I'm afraid we might end up having tuple[Request, A, B, C, D, ...] in the future :/
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_wave
is logically separate from the request, it's only used for coordination purposes at the point that the request is received. Request
is the scheduler's state for the request so it doesn't really belong in there. So I don't think what you mentioned will be a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about wrapping current_wave in a new class (e.g. RequestEnv)? And the interface would become tuple[Request, RequestEnv]
In this way,
- non-request data (i.e. current_wave) go into RequestEnv
- when new similar coming in, we have a place for them without touching the interface
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, gentle nudge @njhill for your thoughts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New class would be more overhead, tuples are very cheap (small allocs are reused). I don't think we have to worry about a place for other values, I don't think it's likely that there will be, and it's better to do that if/when needed in future. This isn't an external interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill Appreciate for your inputs.
I've handed over the idea to @linzebing, and most of your comments should had been addressed. Let's move the discussion there. #21627
vllm/v1/engine/core_client.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where else are we calling add_request
that we need to keep union of 2 types here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least 2 places I noticed. Especially for the later one, not fully sure if we need to further update interface upstream.
vllm/vllm/v1/engine/llm_engine.py
Line 212 in e7b2042
self.engine_core.add_request(request) vllm/vllm/v1/engine/core_client.py
Lines 606 to 609 in e7b2042
def add_request(self, request: EngineCoreRequest) -> None: if self.is_dp: self.engines_running = True self._send_input(EngineCoreRequestType.ADD, request)
An alternative approach is to
- Add non-Union API as
add_request(self, request: Request)
- Expose EngineCoreRequest to Request conversion as API (e.g.
preprocess_ad_request(EngineCoreRequest) -> Request:
, and update logic on caller side
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those should be for the sync engine. we should be able to trigger them directly using pythonic api or bench throughput. do we see similar benefits?
…thread Signed-off-by: Jialin Ouyang <[email protected]>
…ndition Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Handed over to @linzebing in #21627 |
…thread
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
In engine core thread, we used 18us to convert EngineCoreRequest to Request which is on model forward critical path.
Ideally, we should be able to move the conversion from engine core thread to input request thread to relax the logic from critical path.
There's an extra benefit for making the change, as Request became available in input processing threads, which would significantly simply the pending changes for block hashing optimization mentioned in #21247
Test Plan
Profile with the change
Test Result
With the change, handle_client_request in engine core thread reduced from 35us to 7us.

As expected, right now, Request conversion is executing in paralllel with model forward

(Optional) Documentation Update